Median i^-Flats for Hybrid Linear Modeling with Many Outliers * 
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Abstract 

We describe the Median K -flats (MKF) algorithm, a sim- 
ple online method for hybrid linear modeling, i.e., for ap- 
proximating data by a mixture of flats. This algorithm si- 
multaneously partitions the data into clusters while finding 
their corresponding best approximating £i d-flats, so that 
the cumulative ii error is minimized. The current imple- 
mentation restricts d-flats to be d-dimensional linear sub- 
spaces. It requires a negligible amount of storage, and its 
complexity, when modeling data consisting of N points in 
MP with K d-dimensional linear subspaces, is of order 
0{ns ■ K ■ d ■ D + ris ■ d^ ■ D), where Us is the number 
of iterations required for convergence ( empirically on the 
order of 10^ j. Since it is an online algorithm, data can be 
supplied to it incrementally and it can incrementally pro- 
duce the corresponding output. The performance of the al- 
gorithm is carefully evaluated using synthetic and real data. 



Supp. webpage: http://www.math.umn. edu/^lerman/mkf/ 

1. Introduction 

Many common data sets can be modeled by mixtures of 
flats (i.e., affine subspaces). For example, feature vectors of 
different moving objects in a video sequence lie on differ- 
ent affine subspaces (see e.g., [14]), and similarly, images 
of different faces under different illuminating conditions are 
on different linear subspaces with each such subspace cor- 
responding to a distinct face [1]. Such data give rise to the 
problem of hybrid linear modeling, i.e., modeling data by a 
mixture of flats. 

Different kinds of algorithms have been suggested 



for this problem utilizing different mathematical theories. 
For example. Generalized Principal Component Analysis 
(GPCA) [21] is based on algebraic geometry, Agglom- 
erative Lossy Compression (ALC) [13] uses information 
theory, and Spectral Curvature Clustering (SCC) [4] uses 
multi-way clustering methods as well as multiscale geomet- 
ric analysis. On the other hand, there are also some heuristic 
approaches, e.g., Subspace Separation [5, 11, 12] and Local 
Subspace Affinity (LSA) [23]. Probably, the most straight- 
forward method of all is the A'-flats (KF) algorithm or any 
of its variants [10, 17, 3, 20, 8]. 

The /v -flats algorithm aims to partition a given data set 
X = {xi, . . . , xjv} C into K subsets Xi, . . . , X^, 
each of which is well approximated by its best fit d-flat. 
More formally, given parameters K and d, the algorithm 
tries to minimize the objective function 



K 



Emin >^ dist^fxi, L,) , 
d-flats L, ^ \ ]^ 1) 



(1) 
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In practice, the minimization of this function is performed 
iteratively as in the K-means algorithm [15]. That is, af- 
ter an initialization of K d-flats (for example, they may be 
chosen randomly), one repeats the following two steps until 
convergence: 1) Assign clusters according to minimal dis- 
tances to the flats determined at the previous stage. 2) Com- 
pute least squares d-flats for these newly obtained clusters 
by Principal Component Analysis (PCA). 

This procedure is very fast and is guaranteed to converge 
to at least a local minimum. However, in practice, the local 
minimum it converges to is often significantly worse than 
the global minimum. As a result, the A'-flats algorithm is 
not as accurate as more recent hybrid linear modeling algo- 
rithms, and even in the case of underlying linear subspaces 
(as opposed to general affine subspaces) it often fails when 
either d is sufficiently large (e.g., d > 10) or there is a large 
component of outliers. 
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This paper has two goals. The first one is to show that in 
order to significantly improve the robustness to outliers and 
noise of the A'-flats algorithm, it is sufficient to replace its 
objective function (Eq. (1)) with 



K 

E 



d-flats Li 



dist(xj, L> 



(2) 
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that is, replacing the £2 average with an £1 average. The sec- 
ond goal is to establish an online algorithm for this purpose, 
so that data can be supplied to it incrementally, one point at 
a time, and it can incrementally produce the correspond- 
ing output. We believe that an online procedure, which has 
to be very different than A'-flats, can also be beneficial for 
standard settings of moderate-size data which is not stream- 
ing. Indeed, it is possible that such a strategy will converge 
more often to the global minimum of the £1 error than the 
straightforward £1 generalization of i^-flats (assuming an 
accurate algorithm for computing best £1 flats). 

In order to address those goals we propose the Median 
A'-flats (MKF) algorithm. We chose this name since in the 
special case where d = the well-known A'-medians algo- 
rithm (see e.g., [9]) approximates the minimum of the same 
energy function. The MKF algorithm employs a stochastic 
gradient descent strategy [2] in order to provide an online 
approximation for the best £1 d-flats. Its current implemen- 
tation only applies to the setting of underlying linear sub- 
spaces (and not general affine ones). 

Numerical experiments with synthetic and real data in- 
dicate superior performance of the MKF algorithm in var- 
ious instances. In particular, it outperforms some standard 
algorithms in the cases of large outlier component or rela- 
tively large intrinsic dimension of flats. Even on the Hop- 
kins 155 Database for motion segmentation [19], which re- 
quires small intrinsic dimensions, has little noise, and few 
outliers, the MKF performs very well and in particular bet- 
ter than AT-flats. We speculate that this is because the it- 
erative process of MKF converges more often to a global 
minimum than that of the A'-flats. 

The rest of this paper is organized as follows. In Sec- 
tion 2 we introduce the MKF algorithm. Section 3 carefully 
tests the algorithm on both artificial data of synthetic hybrid 
linear models and real data of motion segmentation in video 
sequences. Section 4 concludes with a brief discussion and 
mentions possibilities for future work. 



2. The MKF algorithm 

We introduce here the MKF algorithm and estimate its 
storage and running time. We then discuss some technical 
details of our implementation. 



2.1. Description of algorithm 

The MKF algorithm partitions a data set X = 
{xi, X2, • • • , xat} C into K clusters Xi, X2, . . ., X^, 
with each cluster approximated by a d-dimensional linear 
subspace. 

We start with a notational convention for linear sub- 
spaces. For each 1 < i < A', let be the d x D matrix 
whose rows are the orthogonal basis of the linear subspace 
approximating X^, and note that P^Pf = Idxd- We iden- 
tify the approximating subspaces of clusters Xi , . . . , Xk 
with the matrices Pi , . . . , Pk- 

We define the following energy function for the partition 
{Xi}fLi and the corresponding subspaces {P^jfl^: 



£{{X,}l„{P^}tl)=l^ 



K 



PfP, 



(3) 



The MKF algorithm tries to partition the data into clusters 
{Xi}f£]^ minimizing the above energy. Since the under- 
lying flats are linear subspaces, we can normalize the ele- 
ments of X to lie on the unit sphere, so that | |xj 1 1 ~ 1 for 
each 1 < j < iV, and express the energy function £ as 
follows: 



4=1 xeXi 

K 

= Y.T. Vl-||P.x|P. (4) 

i=l xGXi 

To minimize this energy, the MKF algorithm uses the 
method of stochastic gradient descent [2]. The derivative of 
the energy with respect to a given matrix P^ is 



d£ P,xx^ 
^""x«. v/l-||P.x|P' 



(5) 



The algorithm needs to adjust P^ according to the com- 
ponent of the derivative orthogonal to P^. The part of the 
derivative that is parallel to the subspace P; is 



d£ V- PiXx^Pf P, 



xex. Vl-||P.x| 
Hence the orthogonal component is 

dP, = dxPi, 



xeXi 



where 



(P,XX^ - PjXX^Pf Pi 

v/i-||P.x|P 



(6) 



(7) 



(8) 



In view of the above calculations, the algorithm proceeds 
by picking a point x* at random from the set, and then de- 
ciding which Pi. that point currently belongs to. Then it 
applies the update P^. i— > P;* — dMx»Pi*, where dt (the 
"time step") is a parameter chosen by the user It repeats this 
process until some convergence criterion is met, and assigns 
the data points to their nearest subspaces {Pi}fLi to obtain 
the K clusters. This is summarized in Algorithm 1. 



Algorithm 1 Median if -flats (MKF) 

Input: X = {xi,X2,--- ,xjv} ^ K^: data, normalized 
onto the unit sphere, d: dimension of subspaces, K: num- 
ber of subspaces, {Pi}fLi: the initialized subspaces. dt: 
step parameter. 

Output: A partition of X into K disjoint clusters {X}^j^. 

Steps; 

1 . Pick a random point x* in X 

2. Find its closest subspace P;* , where 

i* = argmaXi<,<^,,||P,x|| 

3. Compute dx*Pi» by Eq. (8) 

4. Update P,.: P;. P;. - dMx*Pi« 

5. Orthogonalize Pi» 

6. Repeat steps 1-5 until convergence' 

7. Assign each Xj to the nearest subspace 



2.2. Complexity and storage of the algorithm 

Note that the data set does not need to be kept in memory, 
so the storage requirement of the algorithm is 0{K ■ d ■ D), 
due to the K d x D matrices {Pi}f£j. 

Finding the nearest subspace to a given point costs 0{K ■ 
d-D) operations. Computing the update costs 0{d-D), and 
orthogonalizing P;* costs 0{d^ ■ D). Consequently, each 
iteration is 0{K -d- D + d^ ■ D). If n,, denotes the numberof 
sampling iterations performed, then the total running time 
of tiie MKF algorithm is 0{ns ■ K ■ d ■ D + Us ■ d"^ ■ D). 

In our experiments we use dt = 0.01. With this choice, 
the numberof sampling iterations Us is typically about 10^. 
Usually ng increases as the data becomes more complex 
(i.e., more flats, more outliers, etc), but in our experiments 
it never exceeded 3 • lO"'. 

'in our experiments we checked the energy functional of Eq. (3) every 
1000 iterations. We stopped if the ratio between current energy and the 
previous one was in the range (0.999,1.001). However, the computation of 
the energy functional depends on the size of the data. For large data sets we 
can obtain an onhne algorithm by replacing the ratio of the energy func- 
tionals, with e.g., the sum of squares of sines of principal angles between 



2.3. Initialization 

Although the algorithm often works well with a random 
initialization of {Pi},f£i, it can many times be improved 
with a more careful initialization. We propose a farthest 
insertion method in Algorithm 2 below. 



Algorithm 2 Initialization for {Pi}|£i 

Input: X = {xi,X2, • • • ,XAr)} e R-^^": data, d: dimen- 
sion, K: number of d-flats 

Output: {Pi}f£i: K subspaces. 

For i = 1 to K, do 

• If I = 1, Pick a random point x in X; otherwise 
pick the point x with the largest distance from the 
available planes {i'l, ^2, • • • iPi-i} 

• Find the smallest integer j such that 

dim(span(jNN(x) - x)) = d, 

where jNN(x) denotes the set of j -nearest neigh- 
bors of X 

• Let Pi be the affine space spanned by x and 
jNN(x) 

end 



If the data has little noise and few outliers, then empir- 
ically, this initialization greatly increases the likelihood of 
obtaining the correct subspaces. On the other hand, in the 
case of sufficiently large noise or outliers, the initialization 
of Algorithm 2 does not work significantly better than ran- 
dom initializations, since the local structure of the data is 
obscured. 

Notice that the initialization of Algorithm 2 also works 
for affine subspaces, so we can use it to initialize other iter- 
ative methods, such as /-sT-flats. 

2.4. Some implementation odds and ends 

Because the algorithm is randomized and the objective 
function may have many local minima, it is useful to restart 
the algorithm several times as often practiced in the if-flats 
algorithm. We can choose the best set of flats over all the 
restarts either measured in the £i sense or in the I2 sense, 
depending on the application. 

The MKF algorithm we have presented is designed for 
data sampled from linear subspaces of the same dimension. 
For affine subspaces, similar as in [21] we can add a homo- 
geneous coordinate so that subspaces become linear. Empir- 
ically, it works well for clean cases with little noise or few 

the corresponding subspaces. 



outliers. However, we are still working on the true affine 
model, to make the algorithm more accurate and robust. 

Also, for mixed dimensions of subspaces, i.e., when the 
dimensions di, d2, ■ ■ ■ , are not identical, we can set 
d to be max(rfi, d2, - ■ ■ , dx) to implement the MKF al- 
gorithm (similarly as in [4]). Experiments show that this 
method works well if there exists a comparably small dif- 
ference among {di} 



K 



3. Simulation and experimental results 

In this section, we conduct experiments on artificial and 
real data sets to verify the effectiveness of the proposed 
MKF algorithm in comparison to other hybrid linear mod- 
eling algorithms. 

We measure the accuracy of those algorithms by the rate 
of misclassified points with outliers excluded, that is 

^ # of misclassified inliers _ 
error% = — r^^^ x 100% . (9) 



# of total inliers 



3.1. Simulations 



We compare MKF with the following algorithms: Mix- 
tures of PPCA (MoPPCA) [17], A'-flats (KF) [8] (im- 
plemented for linear subspaces). Local Subspace Analy- 
sis (LSA) [23], Spectral Curvature Clustering (SCC) [4] 
(we use its version for linear subspaces, LSCC) and 
GPCA with voting (GPCA) [24, 14]. We use the Mat- 
lab codes of the GPCA, MoPPCA and KF algorithms 
from http://perception.csl.uiuc.edu/gpca, the LSCC algo- 
rithm from http://www.math.umn.edu/~lerman/scc and the 
LSA algorithm from http://www.vision.jhu.edu/db. The 
code for the MKF algorithm appears in the supplementary 
webpage of this paper It has been applied with the default 
value of dt = 0.01. 

The MoPPCA algorithm is always initialized with a ran- 
dom guess of the membership of the data points. The 
LSCC algorithm is initialized by randomly picking 100 x K 
{d + l)-tuples (following [4]). On the other hand, KF and 
MKF are initialized with both random guess (they are de- 
noted in this case by KF(R) and MKF(R) respectively) as 
well as the initialization suggested by Algorithm 2 (and 
then denoted by KF and MKF). We have used 10 restarts 
for MoPPCA, 30 restarts for KF, 5 restarts for MKF and 3 
restarts for LSCC, and recorded the misclassification rate 
of the one with the smallest £2 error (Eq. (1)) for MoPPCA, 
LSCC as well as KF, and £1 error (Eq. (3)) for MKF. The 
number of restarts was restricted by the running time. 

The simulated data represents various instances of K 
linear subspaces in R^. If their dimensions are fixed and 
equal d, we follow [4] and refer to the setting as d^ G 
M.^ . If they are mixed, then we follow [14] and refer 
to the setting as (c?i, . . . , c?if ) G M.^ . Fixing K and d 



(or di, . . . , dx), we randomly generate 100 different in- 
stances of corresponding hybrid linear models according to 
the code in http://perception.csl.uiuc.edu/gpca. More pre- 
cisely, for each of the 100 experiments, K linear subspaces 
of the corresponding dimensions in are randomly gen- 
erated. Within each subspace the underlying sampling dis- 
tribution is a cross product of a uniform distribution along 
a d-dimensional cube of sidelength 2 in that subspace cen- 
tered at the origin and a Gaussian distribution in the orthog- 
onal direction centered at the corresponding origin whose 
covariance matrix is scalar with cr = 5% of the diame- 
ter of the cube, i.e., 2 • \/d. Then, for each subspace 250 
samples are generated according to the distribution just de- 
scribed. Next, the data is further corrupted with 5% or 30% 
uniformly distributed outliers in a cube of sidelength deter- 
mined by the maximal distance of the former 250 samples 
to the origin (using the same code). The mean (along 100 
instances) misclassification rate of the various algorithms is 
recorded in Table 1, and the corresponding standard devia- 
tion in Table 3. The mean running time is shown in Table 2. 

From Table 1 we can see that MKF performs well in var- 
ious instances of hybrid linear modeling (with linear sub- 
space), and its advantage is especially obvious with many 
outliers and high dimensions. The initialization of MKF 
with Algorithm 2 does not work as well as random initial- 
ization. This is probably because both the noise level and 
the outlier percentage are too large for the former initializa- 
tion, which is based on only a few nearest neighbors. Nev- 
ertheless, we still notice that this initialization reduces the 
running time of both KF and MKF. 

We conclude from Table 2 that the running time of the 
MKF algorithm is not as sensitive to the size of dimensions 
(either ambient or intrinsic) as the running time of some 
other algorithms such as GPCA, LSA and LSCC. 

Table 3 indicates that GPCA and MoPPCA usually have 
a larger standard deviation of misclassification rate, whereas 
other algorithms have a smaller and comparable such stan- 
dard deviation, and are thus more stable. However, apply- 
ing either KF or MKF without restarts would result in large 
standard deviation of misclassification rates due to conver- 
gence to local minima. 

3.2. Applications 

We apply the MKF algorithm to the Hopkins 155 
database of motion segmentation [19], which is available at 
http://www.vision.jhu.edu/data/hopkinsl55. This data con- 
tains 155 video sequences along with the coordinates of cer- 
tain features extracted and tracked for each sequence in all 
its frames. The main task is to cluster the feature vectors 
(across all frames) according to the different moving ob- 
jects and background in each video. 

More formally, for a given video sequence, we denote the 
number of frames by F. In each sequence, we have either 



Table 1. Mean percentage of misclassified points in simulation. The MKF or KF algorithm with random initialization are denoted by 
MKF(R) and KF(R) respectively. 
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GPCA 

KF 

KF(R) 

LSA 

LSCC 

MKF 

MKF(R) 

MoPPCA 


28.2 
7.8 
8.3 

42.6 
6.7 
9.6 
7.6 

21.7 


43.5 
30.2 
32.8 
46.1 
13.4 
18.8 
17.6 
45.3 


10.5 
2.2 
2.2 

10.6 
2.0 
2.0 
2.0 
7.5 


34.9 
15.4 
15.9 
12.0 
2.4 
2.1 
2.0 
24.3 


14.9 
4.8 
4.8 

21.1 
4.1 
4.0 
3.9 

17.4 


47.8 
27.7 
30.8 
26.5 
5.7 
7.0 
9.7 
40.3 


5.4 
0.6 
0.5 
7.0 
0.3 
0.1 
0.2 
4.6 


42.3 
34.8 
28.8 
8.9 
0.3 
0.1 
0.1 
36.4 


13.0 
2.2 
2.2 

13.1 
1.1 
0.2 
0.2 

11.9 


45.1 
43.4 
41.7 
16.6 
9.5 
0.3 
0.3 
41.7 


19.8 
9.1 
11.0 
29.6 
9.8 
19.2 
17.6 
18.1 


32.1 
25.2 
26.3 
31.3 
14.9 
17.2 
17.1 
30.1 


5.8 
0.8 
0.9 
5.8 
1.4 
0.9 
1.1 
9.4 


43.0 
26.7 
25.4 
6.7 
21.8 
0.7 
0.7 
36.1 


Table 2. Mean running time (in seconds) in simulation. 
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LSA 

LSCC 

MKF 
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MoPPCA 
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1.3 
1.7 

47.7 
7.1 
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7.7 
1.5 
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6.1 
6.5 
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1.7 
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8.3 
8.6 
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24.8 
0.7 
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25.5 
16.3 
6.6 
9.0 
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46.0 
1.0 
1.2 
47.9 
19.6 
10.0 
12.4 
1.2 


29.3 
1.3 
1.4 
31.6 
33.5 
12.1 
15.3 
1.6 


53.6 
1.3 
1.6 
59.5 
39.4 
18.7 
20.7 
1.7 


20.1 
0.7 
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28.7 
6.5 
6.7 
7.5 
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40.2 
1.2 
1.4 

56.7 
7.5 
6.8 
7.3 
1.5 


43.7 
1.0 
1.0 

43.2 

14.0 
7.0 

10.5 
1.0 


81.0 
1.8 
1.9 
82.3 
17.3 
10.7 
13.6 
2.0 



(LSA) [23], Multi Stage Learning (MSL) [16], and Random 
Sample Consensus (RANSAC) [6, 18, 19]. 

We only directly applied KF and MKF, while 
for the other algorithms, we copy the results from 
http://www.vision.jhu.edu/data/hopkinsl55 (they are based 
on experiments reported in [19] and [7]). 

Since the database contains 155 data sets, we just record 
the mean misclassification rate and the median misclassifi- 
cation rate for each algorithm for any fixed K (two or three- 
motions) and for the different type of motions ("checker", 
"traffic" and "articulated") as well as the total database. 

We use 5 restarts for MKF and 20 restarts for KF and 
record the best segmentation result (both based on mean 
squared error). For MKF we use the default value of 
dt = 0.01. Due to the randomness of both MKF and KF, we 
applied them 100 times and recorded the mean and median 
of misclassification rates for both two-motions and three- 
motions (see Table 4 and Table 5). We first applied both 
KF and MKF to the full data (with ambient dimension 2F). 
We applied KF and MKF with the initialization of Algo- 
rithm 2 as well as random initialization (and then used the 
notation KF(R) and MKF(R)). For the purpose of compari- 
son with other algorithms (who could not be applied to the 
full dimension), we also apply both KF and MKF to the 
data with reduced dimensions: 5 and AK (obtained by pro- 



one or two independently moving objects, and the back- 
ground can also move due to the motion of the camera. We 
let K be the number of moving objects plus the background, 
so that K is 2 or 3 (and distinguish accordingly between 
two-motions and three-motions). For each sequence, there 
are also TV feature points yi , y2 , • • • , G M'^ that are de- 
tected on the objects and the background. Let Zij G R'^ 
be the coordinates of the feature point yj in the i*^ im- 
age frame for every I < i < F and 1 < j < N. Then 
Zj = [zij,Z2j,--- ,zpj] E R2^ is the trajectory of the 
j*^ feature point across the F frames. The actual task of 
motion segmentation is to separate these trajectory vectors 
Zi , Z2 , • • • ,zn into K clusters representing the K underly- 
ing motions. 

It has been shown [5] that under affine camera models 
and with some mild conditions, the trajectory vectors cor- 
responding to different moving objects and the background 
across the F image frames live in distinct linear subspaces 
of dimension at most four in R2^" Following this theory, we 
implement both the MKF and KF algorithms with d = 4. 

We compare the MKF with the following algorithms: 
Connected Component Search (CCS) [7], improved GPCA 
for motion segmentation (GPCA) [22], iiT-flats (KF) [8] 
(implemented for linear subspaces). Local Linear Man- 
ifold Clustering (LLMC) [7], Local Subspace Analysis 



Table 3. Standard deviation of misclassification rate in simulation. 
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25.9 
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13.0 
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1.1 
0.6 
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0.5 
12.7 


36.7 
21.0 
27.4 
21.7 
22.4 
21.4 
21.5 
33.8 


34.4 
45.2 
41.4 
20.6 
20.7 
32.4 
30.2 
36.4 


28.4 
1.4 
3.6 
5.1 
5.8 
1.4 
2.0 

37.1 


32.1 
37.0 
40.5 
5.7 
27.5 
1.0 
1.1 
25.1 



Table 4. The mean and median percentage of misclassified points for two-motions in Hopkins 155 database. We use 5 restarts for MKF 
and 20 for KF, and the smallest of the £2 errors is used. By MKF(R) and KF(R) we mean the corresponding algorithm with random 
initialization. 



2-motion 


Checker 


Traffic 


Articulated 


All 


Mean 


Median 


Mean 


Median 


Mean 


Median 


Mean 


Median 


CCS 


16.37 


10.64 


5.27 


0.00 


17.58 


7.07 


12.16 


0.00 


GPCA 


6.09 


1.03 


1.41 


0.00 


2.88 


0.00 


4.59 


0.38 


KF 


5.33 


0.04 


2.36 


0.00 


3.83 


1.11 


4.43 


0.00 


KF4/V 


5.81 


0.17 


3.55 


0.02 


4.97 


1.15 


5.15 


0.06 


KF5 


11.35 


5.47 


4.57 


1.43 


12.47 


5.54 


9.70 


3.65 


KF(R) 


15.37 


6.96 


15.93 


8.61 


12.73 


6.63 


15.27 


7.29 


LLMC AK 


4.65 


0.11 


3.65 


0.33 


5.23 


1.30 


4.44 


0.24 


LLMC5 


4.37 


0.00 


0.84 


0.00 


6.16 


1.37 


3.62 


0.00 


LSA AK 


2.57 


0.27 


5.43 


1.48 


4.10 


1.22 


3.45 


0.59 


LSA 5 


8.84 


3.43 


2.15 


1.00 


4.66 


1.28 


6.73 


1.99 


MKF 


3.70 


0.00 


0.90 


0.00 


6.80 


0.00 


3.26 


0.00 


MKF 4A' 


4.51 


0.01 


1.59 


0.00 


6.08 


0.92 


3.90 


0.00 


MKF 5 


9.37 


4.10 


3.47 


0.00 


10.68 


5.84 


7.97 


2.39 


MKF(R) 


29.06 


31.34 


16.78 


12.49 


25.55 


27.54 


25.57 


28.31 


MSL 


4.46 


0.00 


2.23 


0.00 


7.23 


0.00 


4.14 


0.00 


RANSAC 


6.52 


1.75 


2.55 


0.21 


7.25 


2.64 


5.56 


1.18 



jecting onto the subspace spanned by the top 5 or AK right 
vectors of SVD). We denote the corresponding application 
by KF 5, MKF 5, KF AK and MKF AK. The same naming 
convention was used for LSA and LLMC. Table 4 and Ta- 
ble 5 report the results for two-motions and three-motions 
respectively. 

From Tables 4 and 5 we can see that MKF (with the ini- 
tialization of Algorithm 2) works well for the given data. 
In particular, it exceeds the performance of many other al- 
gorithms, despite that they are more complex. The clear 
advantage of the initialization of Algorithm 2 is probably 
due to the cleanness of the data. It is interesting that even 
though the data has low intrinsic dimensions, little noise 
and few outliers, MKF is still superior to KF. This might be 
due to better convergence of the MKF algorithm to a global 
minimum of the ii energy, whereas KF might get trapped 



in a local and non-global minimum more often. 

The error rates of MKF and KF are very stable. Indeed, 
the standard deviation of misclassification rate from MKF is 
always less than 0.002 for two-motions and less than 0.013 
for three-motions. 

4. Conclusion and future work 

We have introduced the Median i^T-flats which is an on- 
line algorithm aiming to approximate a data set by K best 
£1 d-flats. It is implemented with a stochastic gradient de- 
scent procedure which is experimentally fast. The compu- 
tational complexity is of order 0{ns ■ K ■ d- D + ns - ■ D) 
where Ug is the number of sampling iterations (typically 
about Iff*, where for all experiments performed here it did 
not exceed 3 • 10^), and storage of the MKF algorithm is 



Table 5. The mean and median percentage of misclassified points for three-motions in Hopkins 155 database. We use 5 restarts for MKF 
and 20 for KF, and the smallest of the ^2 errors is used. And by MKF(R) and KF(R) we mean the corresponding algorithm with random 
initialization. 



3 -motion 


Checker 


Traffic 


Articulated 


All 


Mean 


Median 


Mean 


Median 


Mean 


Median 


Mean 


Median 


CCS 


28.63 


33.21 


3.02 


0.18 


44.89 


44.89 


26.18 


31.74 


GPCA 


31.95 


32.93 


19.83 


19.55 


16.85 


28.66 


28.66 


28.26 


KF 


15.61 


11.26 


5.63 


0.57 


13.55 


13.55 


13.50 


6.53 


KF4/V 


16.12 


11.37 


7.06 


0.75 


16.66 


16.66 


14.34 


7.11 


KF5 


26.95 


31.88 


8.09 


5.67 


17.65 


17.65 


22.65 


25.08 


KF(R) 


21.83 


24.52 


8.70 


5.00 


15.85 


15.85 


18.86 


17.81 


LLMC AK 


12.01 


9.22 


7.79 


5.47 


9.38 


9.38 


11.02 


6.81 


LLMC5 


10.70 


9.21 


2.91 


0.00 


5.60 


5.60 


8.85 


3.19 


LSA AK 


5.80 


1.77 


25.07 


23.79 


7.25 


7.25 


9.73 


2.33 


LSA5 


30.37 


31.98 


27.02 


34.01 


23.11 


23.11 


29.28 


31.63 


MKF 


14.50 


12.00 


3.06 


0.01 


15.90 


15.90 


12.29 


6.23 


MKF 4K 


14.26 


10.85 


3.17 


0.00 


15.68 


15.68 


12.12 


5.02 


MKF 5 


24.77 


25.85 


9.47 


5.82 


21.19 


21.19 


21.51 


21.39 


MKF(R) 


41.17 


41.69 


21.38 


17.19 


41.36 


41.36 


37.22 


39.58 


MSL 


10.38 


4.61 


1.80 


0.00 


2.71 


2.71 


8.23 


1.76 


RANSAC 


25.78 


26.01 


12.83 


11.45 


21.38 


21.38 


22.94 


22.03 



of order 0{K ■ d ■ D). This algorithm performs well on 
synthetic and real data distributed around mixtures of linear 
subspaces of the same dimension d. It has a clear advan- 
tage over other studied methods when the data has a large 
component of outliers and when the intrinsic dimension d is 
large. 

There is much work to be done. First of all, there are 
many possible practical improvements of the algorithm. In 
particular, we are interested in extending the MKF algo- 
rithm to affine subspaces by avoiding the normalization to 
the unit sphere (while incorporating the necessary algebraic 
manipulations) as well as improving the expected problem- 
atic convergence to the global minimum (due to many local 
minima in the case of affine subspaces) by better initializa- 
tions. We are also interested in exploring methods for deter- 
mining the number of clusters, K, the intrinsic dimension, 
d, and also developing strategies for mixed dimensions. 

Second of all, we would like to pursue further applica- 
tions of MKF. For example, we believe that it can be used 
advantageously for semi-supervised learning in the setting 
of hybrid linear modeling. We would also like to exploit its 
ability to deal with both substantially large and streaming 
data. 

Third of all, it will also be interesting to try to compara- 
tively analyze the convergence of the following algorithms: 
MKF to the global minimum of the li energy of Eq. (3), a 
straightforward £1 version of the A' -flats algorithm (assum- 
ing an accurate algorithm for finding li flats) to the global 
minimum of the same energy, and A'-flats to the global min- 
imum of the £2 energy. 



Last of all, we are currently developing a theoretical 
framework justifying the robustness of li minimization for 
many instances of our setting. This theory also identifies 
some cases where £1 flats are not robust to outliers and care- 
ful initializations are necessary for MKF. 
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